Introduction and literature review

As a worldwide phenomenal event happening annually, the National Basketball Association (NBA) is definitely a feast for basketball fans. Unsurprisingly, some NBA players enjoy extraordinarily great popularity around the world and earn super high salaries. One of the most popular NBA stars– Stephen Curry– received over $34.7 million in the 2017-18 season. However, the level of popularity varies significantly from person to person. In order for our company to ensure better marketing effect and gain higher profits, we should analyze data and sign athletes who enjoy great popularity on social media platforms to endorsement deals. What are the key driving factors that contribute to the difference in popularity? Do NBA players with more advanced skills tend to be more popular? Meanwhile, are there any non-skill factors that also impact their influential power? In this report, I will analyze three key factors including both skill-related and non-skill related factors that could best predict NBA players’ popularity on social media: the player’s age, the player’s average number of points per game, the percentage of games won by the player’s team in the season.

literature review:

According to previous literature, studies have found that the players’ skill-related features such as years of experience (their years in the league), personal performances including 3-point shots, assists, rebounds and fouls, and also the team performance such as the winning percentage in a season are all factors that are statistically significant in terms of evaluating a player’s value. Studies also argue that some non-skill factors such as the player’s age, nationality, or his personal marketing skills and business skills may also contribute to the difference in public fame, which is measured by the number of followers on social media platforms. Based on a previous article, the estimated average peak age of NBA player is around 27 years old. The age of a player could be interpreted as an index of his years of experience, but more importantly, his accumulated exposure to the public. What’s more, taking age into consideration is also crucial for deciding whether to sign the athlete in the long run. Additionally, studies suggest that besides the player’s personal performance, whether the team is comparatively popular and well-known also plays a big role in light of the player’s public fame. Therefore, inspired by the previous studies, I expect that the player’s age, the player’s average number of points per game and the percentage of games won by the player’s team in the season are three key predictor variables that account for the difference in popularity. To measure the player’s popularity, I used the number of followers on Twitter as an index, which is the response variable in this analysis.

While choosing the most appropriate predictor variables, I used the scatter plot matrix between all the potential predictor variables and the response variable to investigate which pairs of variables showed some kind of association or correlation and which pairs are barely associated. This also helps me find predictor variables that are not highly correlated with one another, thus could boost the fitness of the full multiple regression model. The chosen predictor variables AGE, PTS_PG and W_PCT all show certain associations with the response variable TWITTER_FOLLOWER_COUNT_MILLIONS, but the graph also shows that there are clear extreme outliers and a proper re-expression is needed for further investigation. Additionally, the correlations between each two predictor variables are also quite weak (with scatter plots nearly random patterned), meaning that each of the predictor variable could have abundant predictive power to the response variable, and thus it would be meaningful to add this predictor variable to the multiple regression model.

Summary data

I analyzed a data set of 93 NBA players in the 2017/2018 NBA season and we assume that the data collection process is random. The sample provided information about both skill and non-skill related factors of the players, including the players’ team names, age, their salary paid by the team, whether they are from the US, their offense and defense performances, the average total points the player got in one game, the percent of game won by his team in the season and so on.

Variables Min Q1 Median Mean Q3 Max Standard Deviation
Popularity 0.002 0.064 0.252 1.629 0.912 37.000 4.487266
Age 20.00 24.00 27.00 27.41 29.00 39.00 4.030418
Points 1.50 9.40 14.60 15.21 21.10 31.60 7.368311
Winning % 0.000 0.417 0.507 0.509 0.630 0.824 0.1616416

  1. Response Variable: TWITTER_FOLLOWER_COUNT_MILLIONS: The number of Twitter followers of the player, in the units of millions. Representing popularity of the player. The distribution is strongly right skewed, with several extreme outliers on the high end. It has a median of 0.252 and a mean of 1.629, which the latter one is highly affected by outliers. The standard deviation is 4.487. Thus, a re-expression of taking log might be helpful. After the log re-expression, the distribution is nearly normal.

  2. Predictor variable 1: AGE: The age of the player, in the units of year. The distribution is nearly normal, with a mean of about 27 years old, and standard deviation of 4. From the player’s age, we could estimate his accumulated exposure to the public and his years of experience.

  3. Predictor variable 2: PTS_PG: The player’s average number of points per game, in the unit of points. The distribution is symmetric, with a mean about 15 points and a large standard deviation of 7.368, which means the distribution is spread out.

  4. Predictor variable 3: W_PCT: The percentage of games won by the player’s team in the season. The distribution is nearly normal, with a few outliers on the left side. The mean winning percentage is around 0.5, and the standard deviation is 0.16

Regression Interpretation

There are more than one factors that are associated with the popularity and I want to build a multiple regression model to predict the popularity measured by the number of Twitter Followers of the player (in millions). I used data of the player’s age, points per game and the team’s winning percentage. I checked the scatter plots of the three predictor variables against the log of the response variable, respectively. All three of them show a straight, positive and moderately strong relationship with the re-expressed response variable, which means each of the predictor variable is meaningful to be added in the potential multiple regression model. Furthermore, I calculated the correlations between each two predictor variables, and they are all quite weak. Thus, I can further check the conditions for the multiple regression model.

Regression diagnostics

Check conditions for multiple regression:

  1. Quantitative Variable Condition: All the response variable and predictor variables are quantitative variables.

  2. Straight Enough Condition (Linearity Assumption): Looking at the scatter plots of each predictor variable against the re-expressed log of the response variable, they are all straight enough for the linear regression models to make sense. I also plot the residuals against the predicted values, and it looks patternless. The histogram of the residuals also seems symmetric, and without extreme outliers.

  3. Randomization Condition: we assume that the data collection process is random among all the NBA players, and it is representative of the player’s population.

  4. Does the plot thicken? Condition: Looking at the residual plots, there is no problematic “fan shape”, and the predicted values are roughly equal scattered.

  5. Nearly Normal Condition & Outlier Condition: After re-expression of the response variable by taking natural log, there are no extreme outliers. For other predictor variables, there are no outstanding outliers.

Thus, we can use the multiple regression model for further interpretation and prediction.

log(Twitter Followers in Millions)=9.53+0.19Age+0.16Points Per Game+1.47Winning Percent

Variable Coefficients
(Intercept) -9.53112
Age 0.18622
Points 0.15657
Winning % 1.46670

3D Scatter Plot of the Multiple Regression Model:

```

Interpreting coefficients

calculated number AGE PTS_PG W_PCT
log(yQ3) -0.83817 -0.19787 -0.9624
log(yQ1) -1.78817 -2.06987 -1.27551
change in log(y) 0.95 1.872 0.313
real change in y 0.128 0.62554 0.056

Conclusion

In conclusion, the predictor variables age, average points per game and the team’s winning percentage all have statistically meaningful associations with the number of Twitter followers of the player. Among the three predictor variables, the points per game has relatively stronger impact on the response variable compared with the other two factors, which means that the personal skills is the most important factor among these three towards popularity. Thus, one possible conclusion is that our company should sign NBA players with higher average points per game, since those are the ones who are estimated to be more popular on Twitter and have more powerful marketing potential.

Another possible conclusion is that we should sign players that are not too young, since as age increases in certain range, the estimated number of Twitter followers also increases. However, we must be very careful of the extrapolation issue here. As a common sense, after certain age, the player’s physical condition would get worse and his career is going downhill. Thus, the player might lose attention in the public. Based on previous studies, the peak age of NBA player is around 27 years-old, thus signing players that are performing in their golden age and maybe around 30 years-old would be a reasonable choice.

What’s more, whether the player is in a competitive team is also highly related to his popularity on Twitter. However, the association is not that strong compared with his personal performance. Therefore, we should also take the team’s performance into consideration when signing a player.

However, we should also consider the cost of signing the player and if the potential marketing profit outweighs the cost, we could make the deal.